Data Joins + Transformations

Monday, April 24

Today we will…

Lambda Functions

We know how to use across() to apply built-in functions across many columns.

cereal |> 
  summarize(across(calories:potass, mean))
  calories  protein      fat   sodium    fiber   carbo   sugars   potass
1 106.8831 2.545455 1.012987 159.6753 2.151948 14.5974 6.922078 96.07792

But what if we want to pass additional arguments to a function?

  • E.g., we need to drop missing data (na.rm = T).
  • E.g., we need to calculate a trimmed mean (trim = 0.1).

Lambda Functions

When you want to specify additional arguments, you need to use a lambda function rather than just the function name:

~ <FUN_NAME>(.x, <ARGS>)

Calculate trimmed means, trimming 10% of the data on each end.

cereal |> 
  summarize(across(calories:potass, ~ mean(.x, trim = 0.1)))
  calories protein       fat  sodium    fiber    carbo   sugars   potass
1 107.3016 2.47619 0.8888889 163.254 1.773016 14.71429 6.920635 86.19048

Data Layouts

Tidy Data

Tidy data…

  • is rectangular.
  • has observations as rows and variables as columns.
  • has different formats for different tasks.

R4DS

Consequences of Messy Data

Illustration by Allison Horst

  • Tidy: use the same tools in similar ways for different datasets.
  • Messy: create unique tools that are difficult to generalize.

Creating Tidy Data

We may need to transform our data to turn it into the version of tidy that is best for a task at hand.

Illustration by Allison Horst

Creating Tidy Data

We want to look at mean cereal nutrients based on shelf.

  • The data are in a wide format – a separate column for each nutrient.
  • Transforming the data will make plotting easier.
library(liver)
data(cereal)
head(cereal)
name manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
100% Bran N cold 70 4 1 130 10.0 5.0 6 280 25 3 1 0.33 68.40297
100% Natural Bran Q cold 120 3 5 15 2.0 8.0 8 135 0 3 1 1.00 33.98368
All-Bran K cold 70 4 1 260 9.0 7.0 5 320 25 3 1 0.33 59.42551
All-Bran with Extra Fiber K cold 50 4 0 140 14.0 8.0 0 330 25 3 1 0.50 93.70491
Almond Delight R cold 110 2 2 200 1.0 14.0 8 -1 25 3 1 0.75 34.38484
Apple Cinnamon Cheerios G cold 110 2 2 180 1.5 10.5 10 70 25 1 1 0.75 29.50954

Creating Tidy Data

Code
cereal_wide <- cereal |> 
  group_by(shelf) |> 
  summarise(across(calories:vitamins, mean))
shelf calories protein fat sodium fiber carbo sugars potass vitamins
1 102.5000 2.650000 0.60 176.2500 1.6850000 15.80000 4.800000 75.50000 20.00000
2 109.5238 1.904762 1.00 145.7143 0.9047619 13.61905 9.619048 57.80952 23.80952
3 107.7778 2.861111 1.25 158.6111 3.1388889 14.50000 6.527778 129.83333 35.41667
Code
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")

cereal_wide |> 
  ggplot() +
  geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
  geom_line(aes(x = shelf, y = calories, color = "calories_col")) + 
  geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
  geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
  scale_color_manual(values = my_colors, labels = names(my_colors)) +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")

Code
cereal_long<- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))
shelf Nutrient mean_amount
1 calories 102.5000000
1 carbo 15.8000000
1 fat 0.6000000
1 fiber 1.6850000
1 potass 75.5000000
1 protein 2.6500000
1 sodium 176.2500000
1 sugars 4.8000000
1 vitamins 20.0000000
2 calories 109.5238095
2 carbo 13.6190476
2 fat 1.0000000
2 fiber 0.9047619
2 potass 57.8095238
2 protein 1.9047619
2 sodium 145.7142857
2 sugars 9.6190476
2 vitamins 23.8095238
3 calories 107.7777778
3 carbo 14.5000000
3 fat 1.2500000
3 fiber 3.1388889
3 potass 129.8333333
3 protein 2.8611111
3 sodium 158.6111111
3 sugars 6.5277778
3 vitamins 35.4166667
Code
cereal_long |> 
  ggplot(aes(x = shelf, 
             y = mean_amount, 
             color = Nutrient)) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount")

Pivoting Data

Tidyexpalin animation by Kelsey Gonzalez

Manual Method

Consider daily rainfall observed in SLO in January 2023.

  • The data is in a human-friendly form (like a calendar).
  • Each week has a row, and each day has a column.

Data source

How would you manually convert this to long format?

Manual Method: Steps

  1. Create a new column: Day_of_Week.
  2. Create a new column: Rainfall (hold daily rainfall values).
  3. Now we have three columns (Week, Day_of_Week, and Rainfall) – start moving Sunday values over.
  4. Duplicate Week 1-5 and copy Monday values over.
  5. Duplicate Week 1-5 and copy Tuesday values over.
  6. Continue for the rest of the days of the week.
  7. You may want to arrange() by Week to get the rainfall values chronological order.

Computational Approach

We can use pivot_longer() to turn a wide dataset into a long(er) dataset.

pivot_longer()

Take a wide dataset and turn it into a long daaset.

  • cols – specify the columns that should be pivoted.
    • Do not include the names of ID columns (columns to not be pivoted).
  • names_to – the name of the new column containing the old column names.
  • values_to – the name of the new column containing the old column values.

pivot_longer()

library(readxl)
slo_rainfall <- read_xlsx("data/2023-rainfall-slo.xlsx")

slo_rainfall |> 
  mutate(across(Sunday:Saturday, as.numeric)) |> 
  pivot_longer(cols      = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")
Week Day_of_Week Daily_Rainfall
1 Sunday 0.00
1 Monday 0.12
1 Tuesday 0.00
1 Wednesday 1.58
1 Thursday 0.91
1 Friday 0.00
1 Saturday 0.05
2 Sunday 0.27
2 Monday 4.26
2 Tuesday 0.43
2 Wednesday 0.00
2 Thursday 0.00
2 Friday 0.16
2 Saturday 1.41
3 Sunday 0.34
3 Monday 0.33
3 Tuesday 0.00
3 Wednesday 0.00
3 Thursday 0.13
3 Friday 0.00
3 Saturday 0.00
4 Sunday 0.00
4 Monday 0.00
4 Tuesday 0.00
4 Wednesday 0.00
4 Thursday 0.00
4 Friday 0.00
4 Saturday NA
5 Sunday NA
5 Monday NA
5 Tuesday NA
5 Wednesday NA
5 Thursday NA
5 Friday NA
5 Saturday NA

pivot_wider()

Take a long dataset and turn it into a wide daaset.

  • id_cols – specify the column(s) that contain the ID for unique rows in the wide dataset.
  • names_from – the name of the column containing the new column names.
  • values_from – the name of the column containing the new column values.

pivot_wider()

We calculate the mean amount of protein for cereals on each shelf and for each manufacturer.

mean_protein <- cereal |> 
  group_by(manuf, shelf) |> 
  summarize(mean_protein = mean(protein))
manuf shelf mean_protein
A 2 4.000000
G 1 3.000000
G 2 1.285714
G 3 2.666667
K 1 2.750000
K 2 2.142857
K 3 2.916667
N 1 2.666667
N 2 2.500000
N 3 4.000000
P 1 1.500000
P 2 1.000000
P 3 3.000000
Q 1 5.000000
Q 2 2.000000
Q 3 2.500000
R 1 2.000000
R 3 3.000000

pivot_wider()

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)
manuf 1 2 3
G 3.000000 1.285714 2.666667
K 2.750000 2.142857 2.916667
N 2.666667 2.500000 4.000000
P 1.500000 1.000000 3.000000
Q 5.000000 2.000000 2.500000
R 2.000000 NA 3.000000
A NA 4.000000 NA

Better names in pivot_wider()

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf_")
manuf Shelf_1 Shelf_2 Shelf_3
G 3.000000 1.285714 2.666667
K 2.750000 2.142857 2.916667
N 2.666667 2.500000 4.000000
P 1.500000 1.000000 3.000000
Q 5.000000 2.000000 2.500000
R 2.000000 NA 3.000000
A NA 4.000000 NA

Data Joins

Relational Data

Multiple, interconnected tables of data are called relational.

  • It is the relation between data sets, not just the individual data sets themselves, that are important.

IMDb movie relational data

Data Joins

We can combine (join) data tables based on their relations.

Mutating joins

Add variables from a new dataframe to observations in an existing dataframe.

full_join(), left_join(), right_join(), inner_join(), outer_join()

Filtering Joins

Filter observations based on values in new dataframe.

semi_join(), anti_join()

Keys

A key uniquely identifies an observation in a data set.

  • To combine (join) two datasets, the key needs to be present in both.

inner_join()

Keeps obsertvations when their keys are present in both datasets.

inner_join(): IMDb Example

directors_genres
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
2931 Drama 0.714286
2931 Action 0.428571
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455
movies_directors
director_id movie_id
429 300229
9247 124110
11652 10920
11652 333856
14927 192017
15092 109093
15092 237431

ID: 429, 2931, 11652, 14927, 15092       ID: 429, 9247, 11652, 14927, 15092

inner_join(directors_genres, movies_directors)
director_id genre prob movie_id
429 Adventure 0.750000 300229
429 Fantasy 0.750000 300229
11652 Sci-Fi 0.500000 10920
11652 Sci-Fi 0.500000 333856
11652 Action 0.500000 10920
11652 Action 0.500000 333856
14927 Animation 1.000000 192017
14927 Family 1.000000 192017
15092 Comedy 0.545455 109093
15092 Comedy 0.545455 237431
15092 Crime 0.545455 109093
15092 Crime 0.545455 237431

ID: 429, 2931, 9247, 11652, 14927, 15092

inner_join(): IMDb Example

What if our key does not have the same name?

directors_genres
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
2931 Drama 0.714286
2931 Action 0.428571
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455
directors
id first_name last_name
429 Andrew Adamson
9247 Zach Braff
11652 James (I) Cameron
14927 Ron Clements
15092 Ethan Coen
inner_join(directors_genres, 
           directors, 
           join_by(director_id == id))
id first_name last_name genre prob
429 Andrew Adamson Adventure 0.750000
429 Andrew Adamson Fantasy 0.750000
11652 James (I) Cameron Sci-Fi 0.500000
11652 James (I) Cameron Action 0.500000
14927 Ron Clements Animation 1.000000
14927 Ron Clements Family 1.000000
15092 Ethan Coen Comedy 0.545455
15092 Ethan Coen Crime 0.545455

Join by different variables on dataX and dataY: join_by(a == b) will match dataX$a to dataY$b.

Piping Joins

Remember: the dataset you pipe in becomes the first argument of the function you are piping into!

  • So if you are using a pipe, you will only be specifying the right dataset inside the join function.
inner_join(directors_genres, movies_directors)

…is equivalent to…

directors_genres |> 
  inner_join(movies_directors)

Mutating Joins

  • left_join() – keep only (and all) observations in the left data set

  • right_join() – keep only (and all) observations in the right data set

  • full_join() – keep all observations in both data sets

Mutating Joins

Which directors would remain for each of the following?

  • left_join(directors_genres, movies_directors)
  • right_join(directors_genres, movies_directors)
  • full_join(directors_genres, movies_directors)

directors_genres |> 
  distinct(director_id)
director_id
429
2931
11652
14927
15092
movies_directors |> 
  distinct(director_id)
director_id
429
9247
11652
14927
15092

Filtering Joins: semi_join()

Keeps observations when their keys are present in both datasets, but only keeps variables from the first dataset.


→  

Filtering Joins: semi_join()

directors_genres |> 
  semi_join(movies_directors)
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455

Movie Directors: 429, 2931, 11652, 14927, 15092

directors_genres |>
  filter(director_id %in% movies_directors$director_id)
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455

Filtering Joins: anti_join()

Removes observations when their keys are present in both datasets, and only keeps variables from the first dataset.


→  


Filtering Joins: anti_join()

directors_genres |> 
  anti_join(movies_directors)
director_id genre prob
2931 Drama 0.714286
2931 Action 0.428571

Movie Directors: 429, 2931, 11652, 14927, 15092

directors_genres |>
  filter(!director_id %in% movies_directors$director_id)
director_id genre prob
2931 Drama 0.714286
2931 Action 0.428571

PA 4: Military Spending

Today you will be tidying messy data to explore the relationship between countries of the world and military spending.

  • Due Wednesday, 4/26 at 10:00am

Bonus Challenge: Murder in SQL City

For this challenge, you will be using table joins to solve a murder mystery!

  • Completing this Challenge will require the manipulation of strings – we will focus on this next week.
  • Due Monday, 5/8 at 11:59pm

To do…

  • PA 4: Military Spending
    • Due Wednesday, 4/26 at 10:00am
  • Bonus Challenge: Murder in SQL City
    • Due Monday, 5/8 at 11:59pm

Wednesday, April 26

Today we will…

  • Review Lab 3: Familiarity with AAE
  • Miscellaneous…
    • Clean Variable Names
    • Lifecycle Stages
  • Extensions to Relational Data
  • Lab 4: Avocados
  • Challenge 3: Avocado Toast Ate My Mortgage

Lab 3: Familiarity with AAE

  • The tidyverse package automatically loads ggplot2, dplyr, readr, etc. – do not load these twice!

  • Do not use mutate() + distinct(), when you can use summarize()!

Lab 3: Familiarity with AAE – Sketch it out!

For each demographic group listed below, determine all words in this study that were the most and least familiar, on average.

Clean Variable Names with janitor

Data from external sources likely has variable names not ideally formatted for R.

Names may…

  • contain spaces.
  • start with numbers.
  • start with a mix of capital and lower case letters.
names(military)[1:12]
 [1] "Country"        "Notes"          "Reporting year" "1988"          
 [5] "1989"           "1990"           "1991"           "1992"          
 [9] "1993"           "1994"           "1995"           "1996"          

Clean Variable Names with janitor

The janitor package converts all variable names in a dataset to snake_case.

Names will…

  • start with a lower case letter.
  • have spaces filled in with _.
library(janitor)
military_clean_names <- military |> 
  clean_names()

names(military_clean_names)[1:12]
 [1] "country"        "notes"          "reporting_year" "x1988"         
 [5] "x1989"          "x1990"          "x1991"          "x1992"         
 [9] "x1993"          "x1994"          "x1995"          "x1996"         

Lifceycle Stages

As packages get updated, the functions and function arguments included in those packages will change.

  • The accepted syntax for a function may change.
  • A function/functionality may disappear.

Learn more about lifecycle stages of packages, functions, function arguments in R.

Deprecated Functions

A deprecated functionality has a better alternative available and is scheduled for removal.

  • You get a warning telling you what to use instead.
military_clean |> 
  filter(across(Notes:`2019`, is.na)) |> 
  slice_head(n = 3) |> 
  select(1:8)
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
# A tibble: 3 × 8
  Country      Notes `Reporting year` `1988` `1989` `1990` `1991` `1992`
  <chr>        <chr> <chr>            <chr>  <chr>  <chr>  <chr>  <chr> 
1 Africa       <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  
2 North Africa <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  
3 Sub-Saharan  <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  

Deprecated Functions

You should not use deprecated functions!

Instead, we use…

military_clean |>
  filter(if_all(Notes:`2019`, ~ is.na(.x))) |> 
  slice_head(n = 3) |> 
  select(1:8)
# A tibble: 3 × 8
  Country      Notes `Reporting year` `1988` `1989` `1990` `1991` `1992`
  <chr>        <chr> <chr>            <chr>  <chr>  <chr>  <chr>  <chr> 
1 Africa       <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  
2 North Africa <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  
3 Sub-Saharan  <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  

Superceded Functions

A superseded functionality has a better alternative, but is not going away .

  • This is a softer alternative to deprecation.
  • A superseded function will not give a warning (since there’s no risk if you keep using it), but the documentation will give you a recommendation.

Extensions to Relational Data

Relational Data

When we work with multiple tables of data, we say we are working with relational data.

  • It is the relations, not just the individual datasets, that are important.

When we work with relational data, we rely on keys.

  • A key uniquely identifies an observation in a dataset.
  • A key allows us to relate datasets to each other

IMDb Movies Data

How can we find each director’s active years?

Joining Multiple Data Sets

directors[1:4,]
# A tibble: 4 × 3
     id first_name last_name
  <dbl> <chr>      <chr>    
1   429 Andrew     Adamson  
2  2931 Darren     Aronofsky
3  9247 Zach       Braff    
4 11652 James (I)  Cameron  
movies_directors[1:4,]
# A tibble: 4 × 2
  director_id movie_id
        <dbl>    <dbl>
1         429   300229
2        2931   254943
3        9247   124110
4       11652    10920
movies[1:4,]
# A tibble: 4 × 4
     id name           year  rank
  <dbl> <chr>         <dbl> <dbl>
1 10920 Aliens         1986  8.20
2 17173 Animal House   1978  7.5 
3 18979 Apollo 13      1995  7.5 
4 30959 Batman Begins  2005 NA   

movies_directors |> 
  inner_join(directors, 
             join_by(director_id == id))
director_id movie_id first_name last_name
429 300229 Andrew Adamson
2931 254943 Darren Aronofsky
9247 124110 Zach Braff
11652 10920 James (I) Cameron
11652 333856 James (I) Cameron
14927 192017 Ron Clements
15092 109093 Ethan Coen
15092 237431 Ethan Coen
15093 109093 Joel Coen
15093 237431 Joel Coen
15901 130128 Francis Ford Coppola
15906 194874 Sofia Coppola
16816 350424 Cameron Crowe
17810 297838 Frank Darabont
22104 224842 Clint Eastwood
24758 112290 David Fincher
28395 46169 Mel (I) Gibson
35573 18979 Ron Howard
35838 257264 John (I) Hughes
37872 300229 Vicky Jenson
38746 238695 Mike (I) Judge
41975 314965 David Koepp
44291 17173 John (I) Landis
46315 344203 Jay Levey
48115 313459 George Lucas
56332 192017 John Musker
58201 30959 Christopher Nolan
58201 210511 Christopher Nolan
65940 111813 Rob Reiner
66849 306032 Guy Ritchie
68161 116907 Herbert (I) Ross
74758 238072 Steven Soderbergh
76524 167324 Oliver (I) Stone
78273 176711 Quentin Tarantino
78273 176712 Quentin Tarantino
78273 267038 Quentin Tarantino
78273 276217 Quentin Tarantino
82525 147603 Paul (I) Verhoeven
83616 207992 Andy Wachowski
83617 207992 Larry Wachowski
88802 256630 Unknown Director
movies_directors |> 
  inner_join(directors, 
             join_by(director_id == id)) |> 
  inner_join(movies,
             join_by(movie_id == id)) |> 
  rename(movie_name = name)
director_id movie_id first_name last_name movie_name year rank
429 300229 Andrew Adamson Shrek 2001 8.1
2931 254943 Darren Aronofsky Pi 1998 7.5
9247 124110 Zach Braff Garden State 2004 8.3
11652 10920 James (I) Cameron Aliens 1986 8.2
11652 333856 James (I) Cameron Titanic 1997 6.9
14927 192017 Ron Clements Little Mermaid, The 1989 7.3
15092 109093 Ethan Coen Fargo 1996 8.2
15092 237431 Ethan Coen O Brother, Where Art Thou? 2000 7.8
15093 109093 Joel Coen Fargo 1996 8.2
15093 237431 Joel Coen O Brother, Where Art Thou? 2000 7.8
15901 130128 Francis Ford Coppola Godfather, The 1972 9.0
15906 194874 Sofia Coppola Lost in Translation 2003 8.0
16816 350424 Cameron Crowe Vanilla Sky 2001 6.9
17810 297838 Frank Darabont Shawshank Redemption, The 1994 9.0
22104 224842 Clint Eastwood Mystic River 2003 8.1
24758 112290 David Fincher Fight Club 1999 8.5
28395 46169 Mel (I) Gibson Braveheart 1995 8.3
35573 18979 Ron Howard Apollo 13 1995 7.5
35838 257264 John (I) Hughes Planes, Trains & Automobiles 1987 7.2
37872 300229 Vicky Jenson Shrek 2001 8.1
38746 238695 Mike (I) Judge Office Space 1999 7.6
41975 314965 David Koepp Stir of Echoes 1999 7.0
44291 17173 John (I) Landis Animal House 1978 7.5
46315 344203 Jay Levey UHF 1989 6.6
48115 313459 George Lucas Star Wars 1977 8.8
56332 192017 John Musker Little Mermaid, The 1989 7.3
58201 30959 Christopher Nolan Batman Begins 2005 NA
58201 210511 Christopher Nolan Memento 2000 8.7
65940 111813 Rob Reiner Few Good Men, A 1992 7.5
66849 306032 Guy Ritchie Snatch. 2000 7.9
68161 116907 Herbert (I) Ross Footloose 1984 5.8
74758 238072 Steven Soderbergh Ocean's Eleven 2001 7.5
76524 167324 Oliver (I) Stone JFK 1991 7.8
78273 176711 Quentin Tarantino Kill Bill: Vol. 1 2003 8.4
78273 176712 Quentin Tarantino Kill Bill: Vol. 2 2004 8.2
78273 267038 Quentin Tarantino Pulp Fiction 1994 8.7
78273 276217 Quentin Tarantino Reservoir Dogs 1992 8.3
82525 147603 Paul (I) Verhoeven Hollow Man 2000 5.3
83616 207992 Andy Wachowski Matrix, The 1999 8.5
83617 207992 Larry Wachowski Matrix, The 1999 8.5
88802 256630 Unknown Director Pirates of the Caribbean 2003 NA

Joining on Multiple Variables

Consider the rodent data from Lab 2.

  • We want to add species_id to the rodent measurements.
species
genus species taxa species_id
Dipodomys merriami Rodent DM
Dipodomys ordii Rodent DO
Perognathus flavus Rodent PF
Chaetodipus penicillatus Rodent PP
Peromyscus eremicus Rodent PE
Onychomys leucogaster Rodent OL
Reithrodontomys megalotis Rodent RM
Dipodomys spectabilis Rodent DS
Onychomys torridus Rodent OT
Neotoma albigula Rodent NL
Peromyscus maniculatus Rodent PM
Sigmodon hispidus Rodent SH
Reithrodontomys fulvescens Rodent RF
Chaetodipus baileyi Rodent PB
measurements
genus_name species sex hindfoot_length weight
Dipodomys merriami M 35 40
Dipodomys merriami M 37 48
Dipodomys merriami F 34 29
Dipodomys merriami F 35 46
Dipodomys merriami M 35 36
Dipodomys ordii F 32 52
Perognathus flavus M 15 8
Dipodomys merriami F 36 35
Perognathus flavus M 12 7
Dipodomys merriami F 32 22
Perognathus flavus M 16 9
Dipodomys merriami F 34 42
Perognathus flavus F 14 8
Dipodomys merriami F 35 41
Dipodomys merriami F 37 37
Dipodomys merriami F 35 43
Dipodomys merriami F 35 41
Dipodomys merriami F 33 40
Perognathus flavus F 11 9
Dipodomys merriami F 35 45
Chaetodipus penicillatus F 20 15
Dipodomys merriami M 35 29
Dipodomys merriami M 35 39
Dipodomys merriami F 36 43
Dipodomys merriami M 38 46
Dipodomys merriami M 36 41
Dipodomys merriami M 36 41
Dipodomys merriami M 38 40
Dipodomys merriami M 37 45
Dipodomys merriami F 35 46
Dipodomys merriami F 35 40
Dipodomys merriami F 35 30
Dipodomys merriami M 35 39
Dipodomys merriami M 35 34
Dipodomys merriami F 37 42
Dipodomys merriami M 37 42
Perognathus flavus F 13 8
Dipodomys merriami F 37 31
Dipodomys merriami F 36 40
Dipodomys merriami M 36 37
Dipodomys merriami M 36 48
Dipodomys merriami M 37 42
Dipodomys merriami F 39 45
Chaetodipus penicillatus F 21 16
Dipodomys merriami F 36 36
Dipodomys merriami M 36 42
Dipodomys merriami M 36 44
Dipodomys merriami F 36 41
Dipodomys merriami F 36 40
Dipodomys merriami M 37 34
Dipodomys merriami M 33 40
Dipodomys merriami M 33 44
Dipodomys merriami M 37 44
Dipodomys merriami M 34 36
Dipodomys merriami M 35 33
Dipodomys merriami F 37 46
Dipodomys merriami F 34 35
Dipodomys merriami M 36 46
Dipodomys merriami F 33 37
Dipodomys merriami M 36 34
Dipodomys merriami F 36 45
Perognathus flavus F 15 7
Dipodomys merriami M 37 51
Dipodomys merriami M 35 39
Dipodomys merriami M 36 29
Dipodomys merriami F 32 48
Dipodomys merriami M 38 46
Dipodomys merriami F 37 41
Dipodomys merriami M 37 45
Dipodomys merriami F 35 42
Dipodomys merriami F 36 53
Dipodomys merriami F 35 49
Dipodomys merriami F 36 46
Perognathus flavus F 13 9
Chaetodipus penicillatus F 19 15
Perognathus flavus M 13 4
Dipodomys merriami M 36 48
Dipodomys merriami M 37 51
Dipodomys merriami M 38 50
Dipodomys merriami M 35 44
Dipodomys merriami M 25 44
Dipodomys merriami M 35 45
Dipodomys merriami F 37 45
Peromyscus eremicus M 20 19
Dipodomys merriami F 38 44
Dipodomys merriami F 36 42
Dipodomys merriami M 37 39
Dipodomys merriami M 37 47
Dipodomys merriami M 36 42
Dipodomys merriami M 36 49
Dipodomys merriami M 38 39
Dipodomys merriami F 36 43
Dipodomys merriami M 35 50
Dipodomys merriami M 36 41
Dipodomys merriami M 37 47
Dipodomys merriami F 36 37
Dipodomys merriami M 36 41
Dipodomys merriami F 36 36
Dipodomys merriami M 36 45
Peromyscus eremicus M 19 20
species |> 
  full_join(measurements,
            join_by(species == species, 
                    genus == genus_name))
genus species taxa species_id sex hindfoot_length weight
Dipodomys merriami Rodent DM M 35 40
Dipodomys merriami Rodent DM M 37 48
Dipodomys merriami Rodent DM F 34 29
Dipodomys merriami Rodent DM F 35 46
Dipodomys merriami Rodent DM M 35 36
Dipodomys merriami Rodent DM F 36 35
Dipodomys merriami Rodent DM F 32 22
Dipodomys merriami Rodent DM F 34 42
Dipodomys merriami Rodent DM F 35 41
Dipodomys merriami Rodent DM F 37 37
Dipodomys merriami Rodent DM F 35 43
Dipodomys merriami Rodent DM F 35 41
Dipodomys merriami Rodent DM F 33 40
Dipodomys merriami Rodent DM F 35 45
Dipodomys merriami Rodent DM M 35 29
Dipodomys merriami Rodent DM M 35 39
Dipodomys merriami Rodent DM F 36 43
Dipodomys merriami Rodent DM M 38 46
Dipodomys merriami Rodent DM M 36 41
Dipodomys merriami Rodent DM M 36 41
Dipodomys merriami Rodent DM M 38 40
Dipodomys merriami Rodent DM M 37 45
Dipodomys merriami Rodent DM F 35 46
Dipodomys merriami Rodent DM F 35 40
Dipodomys merriami Rodent DM F 35 30
Dipodomys merriami Rodent DM M 35 39
Dipodomys merriami Rodent DM M 35 34
Dipodomys merriami Rodent DM F 37 42
Dipodomys merriami Rodent DM M 37 42
Dipodomys merriami Rodent DM F 37 31
Dipodomys merriami Rodent DM F 36 40
Dipodomys merriami Rodent DM M 36 37
Dipodomys merriami Rodent DM M 36 48
Dipodomys merriami Rodent DM M 37 42
Dipodomys merriami Rodent DM F 39 45
Dipodomys merriami Rodent DM F 36 36
Dipodomys merriami Rodent DM M 36 42
Dipodomys merriami Rodent DM M 36 44
Dipodomys merriami Rodent DM F 36 41
Dipodomys merriami Rodent DM F 36 40
Dipodomys merriami Rodent DM M 37 34
Dipodomys merriami Rodent DM M 33 40
Dipodomys merriami Rodent DM M 33 44
Dipodomys merriami Rodent DM M 37 44
Dipodomys merriami Rodent DM M 34 36
Dipodomys merriami Rodent DM M 35 33
Dipodomys merriami Rodent DM F 37 46
Dipodomys merriami Rodent DM F 34 35
Dipodomys merriami Rodent DM M 36 46
Dipodomys merriami Rodent DM F 33 37
Dipodomys merriami Rodent DM M 36 34
Dipodomys merriami Rodent DM F 36 45
Dipodomys merriami Rodent DM M 37 51
Dipodomys merriami Rodent DM M 35 39
Dipodomys merriami Rodent DM M 36 29
Dipodomys merriami Rodent DM F 32 48
Dipodomys merriami Rodent DM M 38 46
Dipodomys merriami Rodent DM F 37 41
Dipodomys merriami Rodent DM M 37 45
Dipodomys merriami Rodent DM F 35 42
Dipodomys merriami Rodent DM F 36 53
Dipodomys merriami Rodent DM F 35 49
Dipodomys merriami Rodent DM F 36 46
Dipodomys merriami Rodent DM M 36 48
Dipodomys merriami Rodent DM M 37 51
Dipodomys merriami Rodent DM M 38 50
Dipodomys merriami Rodent DM M 35 44
Dipodomys merriami Rodent DM M 25 44
Dipodomys merriami Rodent DM M 35 45
Dipodomys merriami Rodent DM F 37 45
Dipodomys merriami Rodent DM F 38 44
Dipodomys merriami Rodent DM F 36 42
Dipodomys merriami Rodent DM M 37 39
Dipodomys merriami Rodent DM M 37 47
Dipodomys merriami Rodent DM M 36 42
Dipodomys merriami Rodent DM M 36 49
Dipodomys merriami Rodent DM M 38 39
Dipodomys merriami Rodent DM F 36 43
Dipodomys merriami Rodent DM M 35 50
Dipodomys merriami Rodent DM M 36 41
Dipodomys merriami Rodent DM M 37 47
Dipodomys merriami Rodent DM F 36 37
Dipodomys merriami Rodent DM M 36 41
Dipodomys merriami Rodent DM F 36 36
Dipodomys merriami Rodent DM M 36 45
Dipodomys ordii Rodent DO F 32 52
Perognathus flavus Rodent PF M 15 8
Perognathus flavus Rodent PF M 12 7
Perognathus flavus Rodent PF M 16 9
Perognathus flavus Rodent PF F 14 8
Perognathus flavus Rodent PF F 11 9
Perognathus flavus Rodent PF F 13 8
Perognathus flavus Rodent PF F 15 7
Perognathus flavus Rodent PF F 13 9
Perognathus flavus Rodent PF M 13 4
Chaetodipus penicillatus Rodent PP F 20 15
Chaetodipus penicillatus Rodent PP F 21 16
Chaetodipus penicillatus Rodent PP F 19 15
Peromyscus eremicus Rodent PE M 20 19
Peromyscus eremicus Rodent PE M 19 20
Onychomys leucogaster Rodent OL NA NA NA
Reithrodontomys megalotis Rodent RM NA NA NA
Dipodomys spectabilis Rodent DS NA NA NA
Onychomys torridus Rodent OT NA NA NA
Neotoma albigula Rodent NL NA NA NA
Peromyscus maniculatus Rodent PM NA NA NA
Sigmodon hispidus Rodent SH NA NA NA
Reithrodontomys fulvescens Rodent RF NA NA NA
Chaetodipus baileyi Rodent PB NA NA NA

Lab + Challenge

Lab 4: Avocados + Challenge 4: Avocado Toast Ate My Mortgage


Handy Helpers

rename() – Change names of columns

separate() – Separate values of a variable


Filtering Joins

semi_join(): Keeps values found in another data set

anti_join(): Keeps values not found in another data set

Workflow

  1. Load packages + read in data.
library(tidyverse)
data_original <- read_csv(file = "path/to/datal.csv")
  1. Clean data and save changes – use this data from now on!
data_clean <- data_original |> 
  mutate(across(x1:x5, ~ as.factor(.x)),
         new_var = if_else(...))
  1. Create subsets from your cleaned data for specific tasks.
demographics_subj <- data_clean |> 
  distinct(subj, keep_all = TRUE)
  1. Output only the the information you want in your report.

To do…

  • Lab 4: Avocados
    • Due Friday, 4/28 at 11:59pm
  • Challenge 4: Avocado Toast Ate My Mortgage
    • Due Saturday, 4/29 at 11:59pm
  • Read Chapter 5: Special Data Types
    • Check-in 5.1 - 5.3 due Monday (5/1) at 10:00am
  • Bonus Challenge: Murder Mystery in SQL City
    • Due Monday 5/8 at 11:59pm